Your SlideShare is downloading. ×
0
Big Data and Hadoop

Presenter
Rajkumar Singh
http://rajkrrsingh.blogspot.com/
http://in.linkedin.com/in/rajkrrsingh
Big Data and Hadoop Introduction
Volume

Variety

Velocity

Facebook
Google Plus
Twitter
LinkedIn
Stock Exchange
Healthcar...
The Problem

e.g. Stock Market
The Solution (Hadoop Evolution)
Traditional Approach
GB->TB->PB--ZB
so the processing with RDBMS is Impossible
Challenges In Big data
• Storage -- PB
• Processing – In a timely manner
• Variety of data -- S/SS/US
• Cost
To overcome Big Data Challenges
Hadoop evolves
• Cost Effective – Commodity HW
• Big Cluster – (1000 Nodes) --- Provides S...
Typical Hadoop Infrastructure
What is Hadoop
•

Java Framework to Process erroneous amount of data

Hadoop Core
• HDFS
• Programming Construct (Map Redu...
HDFS
Processing Framework (Mapreduce)
Hadoop Ecosystem
Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed ...
HDFS

250 GB

DFS

250 GB

1 TB File

250 GB

Based on GFS
250 GB
HDFS : Use Cases

• Very large file.
• Reading/Streaming Data Access.
Read data in large volume
Write once and Read freq...
HDFS Building Blocks
Default Block Size
64MB
128MB

1GB file = 1024 MB/128 MB = 8 Blocks

For Small File Size
100 MB File ...
HDFS Daemon Services
• Name Node
• Secondary Name Node
• Data Node

GFS (Master/Slave Architecture)
HDFS Write
File 1: D1,D2,D4
File 2: D1,D2,D3

128 MB
RF = 3

D1

D1,D2,D4

D2

D3

D4
HDFS File System Commands
HDFS Federation
High Availability
Copying Data from one Cluster to another
Cluster

UAT Cluster

Prod Cluster

Parallel copying using distcp

hadoop distcp ...
Big Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Upcoming SlideShare
Loading in...5
×

Big Data and Hadoop Ecosystem

1,649

Published on

Brief introduction of Hadoop Ecosystem's component

Published in: Education, Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,649
On Slideshare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
55
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data and Hadoop Ecosystem"

  1. 1. Big Data and Hadoop Presenter Rajkumar Singh http://rajkrrsingh.blogspot.com/ http://in.linkedin.com/in/rajkrrsingh
  2. 2. Big Data and Hadoop Introduction Volume Variety Velocity Facebook Google Plus Twitter LinkedIn Stock Exchange Healthcare Telecom Structured,SemiStructured,unstructured Facebook Stock Exchange Healthcare Telecom Mobile Devices GPS Security Infrastructure
  3. 3. The Problem e.g. Stock Market
  4. 4. The Solution (Hadoop Evolution) Traditional Approach
  5. 5. GB->TB->PB--ZB so the processing with RDBMS is Impossible
  6. 6. Challenges In Big data • Storage -- PB • Processing – In a timely manner • Variety of data -- S/SS/US • Cost
  7. 7. To overcome Big Data Challenges Hadoop evolves • Cost Effective – Commodity HW • Big Cluster – (1000 Nodes) --- Provides Storage n Processing • Parallel Processing – Map reduce • Big Storage – Memory per node * no of Nodes / RF • Fail over mechanism – Automatic Failover • Data Distribution • Map Reduce Framework • Moving Code to data • Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of any memory and CPU configuration) • Scalable
  8. 8. Typical Hadoop Infrastructure
  9. 9. What is Hadoop • Java Framework to Process erroneous amount of data Hadoop Core • HDFS • Programming Construct (Map Reduce)
  10. 10. HDFS
  11. 11. Processing Framework (Mapreduce)
  12. 12. Hadoop Ecosystem
  13. 13. Hadoop Sub-Projects • Hadoop Common: The common utilities that support the other Hadoop subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library. • Pig™: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper™: A high-performance coordination service for distributed applications. 
  14. 14. HDFS 250 GB DFS 250 GB 1 TB File 250 GB Based on GFS 250 GB
  15. 15. HDFS : Use Cases • Very large file. • Reading/Streaming Data Access. Read data in large volume Write once and Read frequent • Expensive Hardware. • Low latency Access. • Lots of small files • Parallel write/ Arbitrary Read
  16. 16. HDFS Building Blocks Default Block Size 64MB 128MB 1GB file = 1024 MB/128 MB = 8 Blocks For Small File Size 100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of HDFS of size 100 MB
  17. 17. HDFS Daemon Services • Name Node • Secondary Name Node • Data Node GFS (Master/Slave Architecture)
  18. 18. HDFS Write File 1: D1,D2,D4 File 2: D1,D2,D3 128 MB RF = 3 D1 D1,D2,D4 D2 D3 D4
  19. 19. HDFS File System Commands
  20. 20. HDFS Federation
  21. 21. High Availability
  22. 22. Copying Data from one Cluster to another Cluster UAT Cluster Prod Cluster Parallel copying using distcp hadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×