Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

www.edureka.co/r-for-analytics
www.edureka.co/hadoop-admin
Hadoop : A Highly Available and Secure Enterprise Data
warehousing Solution

Slide 2Slide 2 www.edureka.co/hadoop-admin
At the end of this webinar we will Know about:
 What is Big Data
 Why do Enterprise care about Big Data
 Why your DWH needs Hadoop?
 Security in Hadoop
 How Hadoop maintains high Availability
 Data warehousing tools in Hadoop
Agenda

What is Big Data

What is Wrong with our traditional DWH Solutions

 Storing Unstructured data like images and video
 Processing images and video
 Storing and processing other large files
 PDFs, Excel files
 Processing large blocks of natural language text
 Blog posts, job ads, product descriptions
 Processing semi-structured data
 CSV, JSON, XML, log files
 Sensor data
When RDBMS Makes no Sense?

 Ad-hoc, exploratory analytics
 Integrating data from external sources
 Data cleanup tasks
 Very advanced analytics (machine learning)
When RDBMS Makes no Sense?

 It is:
– Unstructured
– Unprocessed
– Un-aggregated
– Un-filtered
– Repetitive
– Low quality
– And generally messy.
Oh, and there is a lot of it.
Big Problems with Big Data

 Storage capacity
 Storage throughput
 Pipeline throughput
 Processing power
 Parallel processing
 System Integration
 Data Analysis
Scalable storage
Massive Parallel Processing
Ready to use tools
Technical Challenges

Too many channels for data
Technical Challenges

Why do Enterprise care about Big Data

You said RDBMS does not have
solution
for Big Data,
Then who has???

I Have The solution for Big Data Problem
Hadoop
Hadoop : The Savior

How Hadoop differs from RDBMS
Hadoop can store all types of data in it so that you have flexibility of analyzing all types of data.
You can drill down the big data to find even the rare insight which was not possible earlier.

First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL
• Before loading you should
transform data in particular
format
• This puts an restriction on the
type of data that can be stored

First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL
• Before loading you should
transform data in particular
format
• This puts an restriction on the
type of data that can be stored
• This is ELT
• There is no need to transform
the data beforehand
• You can have all kind of data on
board
• Freedom to work with all data

Hadoop is the new Data Warehouse for all kind of BI requirements.
Hadoop Does ELT Not ETL

Core Features of Hadoop

Hadoop Is Fault Tolerant And Super Consistent

Maintaining High Availability(HA)
In Distributed Computing, failure is a norm, which means YARN should have acceptable amount of availability
NameNode - No Horizontal Scale
NameNode - No High Availability
Data
Node
Data
Node
Data
Node
….
Client get Block Locations
Read Data
NameNode
NS
Block Management

 Secondary NameNode:
 "Not a hot standby" for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
Secondary
NameNode
NameNode
metadata
metadata
Single Point
Failure
You give me
metadata
every hour, I
will make it
secure
NameNode – Single Point of Failure

Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
DataNode
Container
App
Master
Data Node
Client
DataNode
Container
App
Master
Node Manager
DataNode
Container
App
Master
Node Manager
NameNode
High
Availability
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Hadoop 2.0 Cluster Architecture - HA

Demo
Achieving HDFS and
YARN High Availability

Hadoop is Secure

Security
 Service-level authorization and web proxy
capabilities in YARN.
 Access Control Lists(ACL) : The Hadoop
Distributed File System (HDFS) implements a
permissions model for files and directories that
shares much of the POSIX model

Security – Simple Flow
 Security Risks
 Insufficient Authentication
 Do not authenticate users services
 No Privacy and No Integrity
 Insecure Network Transport
 No Message level security
 Arbitrary Code Execution
 No User verification for MapReduce code
execution, malicious users could submit a job
Client Job Tracker
HDFS
Task Tracker
Task
HDFS
Task Tracker
Task

Managing users, permissions , quotas, etc …
Checking Resources Usage And Users Permissions

Hadoop provides traditional SQL interface as well as
NoSQL Interface foe data storage

Hive ??

Hive Architecture

Hbase and its Architecture??

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey

Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

Similar to Hadoop a Highly Available and Secure Enterprise Data Warehousing solution (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Hadoop a Highly Available and Secure Enterprise Data Warehousing solution

Editor's Notes