2. Slide 2Slide 2Slide 2 www.edureka.co/hadoop-admin
At the end of this webinar we will Know about:
What is Big Data
Why do Enterprise care about Big Data
Why your DWH needs Hadoop?
Security in Hadoop
How Hadoop maintains high Availability
Data warehousing tools in Hadoop
Agenda
5. Slide 5Slide 5Slide 5 www.edureka.co/hadoop-admin
What is Wrong with our traditional DWH Solutions
6. Slide 6Slide 6Slide 6 www.edureka.co/hadoop-admin
Storing Unstructured data like images and video
Processing images and video
Storing and processing other large files
PDFs, Excel files
Processing large blocks of natural language text
Blog posts, job ads, product descriptions
Processing semi-structured data
CSV, JSON, XML, log files
Sensor data
When RDBMS Makes no Sense?
7. Slide 7Slide 7Slide 7 www.edureka.co/hadoop-admin
Ad-hoc, exploratory analytics
Integrating data from external sources
Data cleanup tasks
Very advanced analytics (machine learning)
When RDBMS Makes no Sense?
8. Slide 8Slide 8Slide 8 www.edureka.co/hadoop-admin
It is:
– Unstructured
– Unprocessed
– Un-aggregated
– Un-filtered
– Repetitive
– Low quality
– And generally messy.
Oh, and there is a lot of it.
Big Problems with Big Data
9. Slide 9Slide 9Slide 9 www.edureka.co/hadoop-admin
Storage capacity
Storage throughput
Pipeline throughput
Processing power
Parallel processing
System Integration
Data Analysis
Scalable storage
Massive Parallel Processing
Ready to use tools
Technical Challenges
10. Slide 10Slide 10Slide 10 www.edureka.co/hadoop-admin
Too many channels for data
Technical Challenges
11. Slide 11Slide 11Slide 11 www.edureka.co/hadoop-admin
Why do Enterprise care about Big Data
14. Slide 14Slide 14Slide 14 www.edureka.co/hadoop-admin
You said RDBMS does not have
solution
for Big Data,
Then who has???
15. Slide 15Slide 15Slide 15 www.edureka.co/hadoop-admin
I Have The solution for Big Data Problem
Hadoop
Hadoop : The Savior
16. Slide 16Slide 16Slide 16 www.edureka.co/hadoop-admin
How Hadoop differs from RDBMS
Hadoop can store all types of data in it so that you have flexibility of analyzing all types of data.
You can drill down the big data to find even the rare insight which was not possible earlier.
17. Slide 17Slide 17Slide 17 www.edureka.co/hadoop-admin
First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL
• Before loading you should
transform data in particular
format
• This puts an restriction on the
type of data that can be stored
18. Slide 18Slide 18Slide 18 www.edureka.co/hadoop-admin
First Load the data then do whatever you want to do.
This is Possible because of the cheap storage and distributed HDFS.
Hadoop Is The New DWH Solution
• This is ETL
• Before loading you should
transform data in particular
format
• This puts an restriction on the
type of data that can be stored
• This is ELT
• There is no need to transform
the data beforehand
• You can have all kind of data on
board
• Freedom to work with all data
19. Slide 19Slide 19Slide 19 www.edureka.co/hadoop-admin
Hadoop is the new Data Warehouse for all kind of BI requirements.
Hadoop Does ELT Not ETL
21. Slide 21Slide 21Slide 21 www.edureka.co/hadoop-admin
Hadoop Is Fault Tolerant And Super Consistent
22. Slide 22Slide 22Slide 22 www.edureka.co/hadoop-admin
Maintaining High Availability(HA)
In Distributed Computing, failure is a norm, which means YARN should have acceptable amount of availability
NameNode - No Horizontal Scale
NameNode - No High Availability
Data
Node
Data
Node
Data
Node
….
Client get Block Locations
Read Data
NameNode
NS
Block Management
23. Slide 23Slide 23Slide 23 www.edureka.co/hadoop-admin
Secondary NameNode:
"Not a hot standby" for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
Secondary
NameNode
NameNode
metadata
metadata
Single Point
Failure
You give me
metadata
every hour, I
will make it
secure
NameNode – Single Point of Failure
24. Slide 24Slide 24Slide 24 www.edureka.co/hadoop-admin
Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
DataNode
Container
App
Master
Data Node
Client
DataNode
Container
App
Master
Node Manager
DataNode
Container
App
Master
Node Manager
NameNode
High
Availability
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Hadoop 2.0 Cluster Architecture - HA
27. Slide 27Slide 27Slide 27 www.edureka.co/hadoop-admin
Security
Service-level authorization and web proxy
capabilities in YARN.
Access Control Lists(ACL) : The Hadoop
Distributed File System (HDFS) implements a
permissions model for files and directories that
shares much of the POSIX model
28. Slide 28Slide 28Slide 28 www.edureka.co/hadoop-admin
Security – Simple Flow
Security Risks
Insufficient Authentication
Do not authenticate users services
No Privacy and No Integrity
Insecure Network Transport
No Message level security
Arbitrary Code Execution
No User verification for MapReduce code
execution, malicious users could submit a job
Client Job Tracker
HDFS
Task Tracker
Task
HDFS
Task Tracker
Task
37. Slide 37
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey
Editor's Notes
Big data is not called big data because it fits well into a thumb-drive.
It requires a lot of storage, partially because it’s a lot of data. Partially because it is unstructured, unprocessed, un-aggregated, repetitive and generally messy